Welcome back to Pattern Recognition. So today we want to continue thinking about discriminant
modeling in feature transforms and we had this idea of doing essentially a class-wise
normalization in our feature space and if we were to do so then certain properties in
this feature space would emerge that we want to have a look on today.
So this is essentially the pathway towards linear discriminant analysis. We have some input training
data that is now not just associated with feature vectors but we also have the class labels and if
we had the class labels then we could essentially apply the transform as we already talked about.
So let's see how we could apply this in order to find a transform that models the different
distributions of the classes differently. Well the first thing that we need to do is we need to figure
out the joint covariance matrix. So here we are actually looking into all of the observations so
we compute just a single covariance matrix over the entire set but of course we regard the class
membership by essentially normalizing with different means. So we compute the means for
the different classes and then we compute a joint covariance for all of the feature vectors that
essentially computes the variance with respect to all of the observations and they are of course
compared to their respective class means. So this gives us a joint covariance matrix
sigma hat and now we can use our trick of sigma hat that can be decomposed into UDU transpose
and this then allows us the definition of a transform phi that will enable us to map into
this normalized space and this normalized space would then be given as D to the power of minus
0.5 times U transpose. If we do so then we can apply this again on our means and we get essentially
the normalized means and these normalized means then would be given as mu prime and mu prime is
essentially simply the application of the transform. So now we have the feature transform phi and the
transform mean vectors mu prime. Now let's look into the actual decision rule on this
sphered data as you could say and here then we would have our y star that is again maximizing
the posterior probability. Now we use again the trick of decomposing the posterior into the prior
and the class conditionals. Here we already know that we're going to use a Gaussian so we apply
the logarithm and get rid of all the terms that are not changing the decision. So everything that is
essentially independent of x has been removed in the right hand term and then we see we just end up
with the exponent of the Gaussian distribution and also we see that in this case we have an
identity matrix as covariance so this one cancels out and we just have the feature transform in here
and what we see if we regard this then we can essentially remap this inner product into an L2
norm so we essentially are looking here into a comparison in a normalized space that is
essentially computing the Euclidean distance between those normalized points and of course
there is still some influence given the class prior which would essentially cancel out in the
decision rule if we had the same priors for all of the classes. So some conclusions if all of the
classes share the same prior this is nothing else than the nearest neighbor decision rule where the
transformed mean vectors are used as prototypes so we simply choose the class where we are transformed
to the closest. However also note that the feature transform phi here does not change the dimension
of the features so here we essentially just have a rotation and scaling that is applied and this is
steered by the global covariance. Now let's think about whether this makes sense or not and I have
a geometric argument here so here we show the two means so let's consider the case of two classes
and if I transform everything then I'm in this transformed space so everything here is
essentially mapped by phi so this is why I have phi of x and phi of mu1 and phi of mu0.
Now I can connect the two class centers so mu0 or phi of mu0 and phi of mu1 and this gives us the
connection a so a is the vector that is connecting the two class centers and it's determined simply
as the subtraction of the two. This is the information that is really relevant for deciding
the class so remember our decision boundaries in this normalized space are going to be lines.
Now if I have this connection between the two then for the classification of an arbitrary
feature vector it actually doesn't make any difference anymore whether it's being moved
off this decision boundary so we can essentially take an arbitrary vector x and then we can compute
the difference between phi of x and phi of mu0 and you see that it doesn't matter how we move it
Presenters
Zugänglich über
Offener Zugang
Dauer
00:12:13 Min
Aufnahmedatum
2020-11-04
Hochgeladen am
2020-11-04 12:18:22
Sprache
en-US
In this video, we look into some useful properties of discriminant analyses.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You